Background
This projects looks at vaccinations administered for Covid-19 over
time, by manufacturer. Covid-19 has affected people worldwide and with
media attention heavily focused upon vaccination uptake and potential
benefits and risks of the varying vaccinations offered, I was interested
to see how many vaccinations had actually been administered over time
and by whom. The data is collected from [https://www.kaggle.com/code/fit4kz/covid-immunizations-analysis-in-r/data],
and is a Kaggle data set, plotted and visualised in to varying
interactive line graph formats.
The following steps provide a guide to achieve this with code provided
for use in RStudio. Various packages are used and may need to be
installed before running the code, as may pathway specification to local
drives for accessing the data and saving plots, scripts and
codebooks.
The package ‘here’ has been used within the code to try to minimise
these issues and set the working directory to wherever the user is
working from.
Aims
The data set lends itself to investigate several outcomes. I was interested at finding total vaccinations administered worldwide to date, total vaccinations administered by country to see which country had the largest vaccination uptake, and finally total vaccinations administered over time by manufacturer, to see which vaccine was administered most worldwide to date.
Load and view data
First we need to load all packages used in RStudio, which can be installed using the install.packages() function
#load packages
library(tidyverse)
library(here)
library(ggplot2)
library(gganimate)
library(gifski)
library(png)
library(dplyr)
library(plotly)
library(highcharter)
library(codebook)
library(future)Next we need to import the data
#load data
vaccinations <- read.csv(here("data", "vaccinations.csv"))Now we should take a look at the data itself
head(vaccinations)## location date vaccine total_vaccinations
## 1 Austria 2021-01-08 Johnson&Johnson 0
## 2 Austria 2021-01-08 Moderna 0
## 3 Austria 2021-01-08 Novavax 0
## 4 Austria 2021-01-08 Oxford/AstraZeneca 0
## 5 Austria 2021-01-08 Pfizer/BioNTech 31641
## 6 Austria 2021-01-15 Johnson&Johnson 0
We can see there are four variables location, date, vaccine and total vaccinations. By using the below code we can also see the class of variable according to RStudio
#view data set
glimpse(vaccinations)## Rows: 31,127
## Columns: 4
## $ location <chr> "Austria", "Austria", "Austria", "Austria", "Austri~
## $ date <chr> "2021-01-08", "2021-01-08", "2021-01-08", "2021-01-~
## $ vaccine <chr> "Johnson&Johnson", "Moderna", "Novavax", "Oxford/As~
## $ total_vaccinations <int> 0, 0, 0, 0, 31641, 0, 99, 0, 0, 117141, 0, 343, 0, ~
Here we can see that location, date and vaccine are classed as characters and total vaccinations is classed as interval data. We need to format some of the options and data so RStudio knows how to correctly interpret certain values within the data set.
Preparing the data
Specifying that the date variable is a date, ensures the variable is interpreted that way by RStudio
# create a variable with the correct formats
vaccinations.formatted <- vaccinations %>%
mutate(date = as.Date(date))
#change limits for scientific as y axis values are so high and will show as scientific values without this code
vaccinations.formatted <- vaccinations.formatted %>%
mutate(total_vaccinations = total_vaccinations / 1000000)
options(scipen = 30000000)Checking for duplicates in the data set ensures data, variables or trials are not read twice by RStudio and therefore skewing results.
#check for duplicates in the data set
anyDuplicated(vaccinations.formatted)
#There are no duplicates in this data setCreating dataframes to answer aims
#find total vaccinations administered worldwide
vaccinations_worldwide <- vaccinations.formatted %>%
summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE))
vaccinations_worldwide## total_immunizations
## 1 469784.6
#find total vaccinations by country
vaccinations_by_country <- vaccinations.formatted %>%
group_by(location) %>%
summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_immunizations))
vaccinations_by_country## # A tibble: 42 x 2
## location total_immunizations
## <chr> <dbl>
## 1 European Union 174635.
## 2 United States 127696.
## 3 Germany 35207.
## 4 France 29067.
## 5 Italy 25612.
## 6 South Korea 19142.
## 7 Chile 9275.
## 8 Peru 9158.
## 9 Ecuador 4599.
## 10 Ukraine 4358.
## # ... with 32 more rows
Creating a plot
We now need to create the basic plot using the ggplot package
#First create a dataframe that has total vaccinations by manufacturer as the set variables
vaccinations_by_manufacturer <- vaccinations.formatted %>%
drop_na() %>%
group_by(date,
vaccine) %>%
summarise(total_vaccinations = sum(total_vaccinations), .groups = 'drop')Now that we have our dataframe we can create our plot using the ggplot and gganimate packages
#create a basic plot showing total vaccinations over time, grouped by vaccine
#customise plot
#add labels and theme
#animate plot over time using gganimate
p1 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic() +
transition_reveal(date)
#view plot
p1#Save plot as a gif
anim_save(here("output", "p1gganimate.gif"))We can also save this plot as a static picture by saving as a .png file
#save plot as a .png file
p1 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic()
ggsave(here("output", "p1ggplot.png"))Adding plotly allows us to make the plot interactive, with a table that offers daily information on total vaccines administered by manufacturer when hovering the cursor over a data point on the plot. This allows for a more in depth look at the data shown, as well as still providing a clean and easy to read overview.
#remove animation and add plotly
p2 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic()
p2 <- ggplotly(p2)
p2Although the plot can be downloaded as a png when viewed, saving the graph as a HTML file creates a file with all needed JavaScript and CSS dependency files contained within it, that can be viewed online and is still interactive.
#Save plotly as HTML file
htmlwidgets::saveWidget(p2, "output/p2plotly.html")Highcharter
I like the interactive element of the plotly graph, and tidying and exploring more with this plot is quite possible in ggplot. Making use of another package called highcharter, allows for further visual finesse, with differing themes and effects to choose from, increasing impact to your plot. Highcharter has an excellent tooltip with many customization functions, which we will be using and coding for below. Highcharter is written in pure JavaScript, making adding interactive charts to web sites or web applications easier.
#plotting the data in a similar way using highcharter
p3 <- vaccinations_by_manufacturer %>%
hchart('line', hcaes(x = date, y = total_vaccinations, group = vaccine)) %>%
hc_title(text = 'Covid 19 vaccinations over time by manufacturer') %>%
hc_subtitle(text = 'Number of administered vaccinations for Covid 19 over time and by each manufacturer') %>%
hc_xAxis(title = list(text = 'Total administered vaccinations')) %>%
hc_yAxis(title = list(text = 'Date')) %>%
hc_tooltip(crosshairs = TRUE, sort = TRUE, borderWidth = 6, table = TRUE, shared = TRUE) %>%
hc_add_theme(hc_theme_ffx()) %>%
hc_exporting(enabled = TRUE, filename = "plots/vaccinations-by-manufacturer-highcharter")
#view plot
p3By defining that exporting is enabled allows the plot to always be exported when viewed, with the plot being saved as a HTML file in the same way as the previous plotly graph
#save highcharter plot
htmlwidgets::saveWidget(p3, "output/p3highcharter.html")Codebook
create a codebook that uses metadata to give a technical overview of the dataframe used to create the plots which specify the values RStudio attributes. When viewed the codebook is saved in a subfolder called figures as .pdf files
#create and view a codebook
codebook <- codebook(vaccinations_by_manufacturer)
codebookDiscussion
I found this data easy to work with to answer my aims, and it was from a reliable source. Source data only contained the original data file, which was transferred into this project after unzipping.
I have found this project a steep learning curve, but enjoyable. I
have learnt so much about the use of RStudio for visualizing data and
presenting research in an open format online.
I think one of the most challenging aspects of the project was adapting
code to suit my needs. I had very clear ideas around what I wanted to do
and functions I needed to perform to get there, but in the tailoring of
functions I generally ran in to errors that required me to get a better
understanding of the function I wanted to perform, before I could amend
the code correctly and it would run error free. Although displaying only
1 plot may have produced more impact, I felt it was important that users
could see what the animated and interactive plots would look like
without having the run the code or search the output folder, as well as
showing logical working progression of the plots made, hence showing 3
of the plots within the published markdown.